33 research outputs found
YodaNN: An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration
Convolutional neural networks (CNNs) have revolutionized the world of
computer vision over the last few years, pushing image classification beyond
human accuracy. The computational effort of today's CNNs requires power-hungry
parallel processors or GP-GPUs. Recent developments in CNN accelerators for
system-on-chip integration have reduced energy consumption significantly.
Unfortunately, even these highly optimized devices are above the power envelope
imposed by mobile and deeply embedded applications and face hard limitations
caused by CNN weight I/O and storage. This prevents the adoption of CNNs in
future ultra-low power Internet of Things end-nodes for near-sensor analytics.
Recent algorithmic and theoretical advancements enable competitive
classification accuracy even when limiting CNNs to binary (+1/-1) weights
during training. These new findings bring major optimization opportunities in
the arithmetic core by removing the need for expensive multiplications, as well
as reducing I/O bandwidth and storage. In this work, we present an accelerator
optimized for binary-weight CNNs that achieves 1510 GOp/s at 1.2 V on a core
area of only 1.33 MGE (Million Gate Equivalent) or 0.19 mm and with a power
dissipation of 895 {\mu}W in UMC 65 nm technology at 0.6 V. Our accelerator
significantly outperforms the state-of-the-art in terms of energy and area
efficiency achieving 61.2 TOp/s/[email protected] V and 1135 GOp/s/[email protected] V, respectively
Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine
Deep neural networks have achieved impressive results in computer vision and
machine learning. Unfortunately, state-of-the-art networks are extremely
compute and memory intensive which makes them unsuitable for mW-devices such as
IoT end-nodes. Aggressive quantization of these networks dramatically reduces
the computation and memory footprint. Binary-weight neural networks (BWNs)
follow this trend, pushing weight quantization to the limit. Hardware
accelerators for BWNs presented up to now have focused on core efficiency,
disregarding I/O bandwidth and system-level efficiency that are crucial for
deployment of accelerators in ultra-low power devices. We present Hyperdrive: a
BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel
binary-weight streaming approach, which can be used for arbitrarily sized
convolutional neural network architecture and input resolution by exploiting
the natural scalability of the compute units both at chip-level and
system-level by arranging Hyperdrive chips systolically in a 2D mesh while
processing the entire feature map together in parallel. Hyperdrive achieves 4.3
TOp/s/W system-level efficiency (i.e., including I/Os)---3.1x higher than
state-of-the-art BWN accelerators, even if its core uses resource-intensive
FP16 arithmetic for increased robustness
MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory
Shared L1 memory clusters are a common architectural pattern (e.g., in
GPGPUs) for building efficient and flexible multi-processing-element (PE)
engines. However, it is a common belief that these tightly-coupled clusters
would not scale beyond a few tens of PEs. In this work, we tackle scaling
shared L1 clusters to hundreds of PEs while supporting a flexible and
productive programming model and maintaining high efficiency. We present
MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring
application-tunable functional units. We designed and implemented an efficient
low-latency PE to L1-memory interconnect, an optimized instruction path to
ensure each PE's independent execution, and a powerful DMA engine and system
interconnect to stream data in and out. MemPool is easy to program, with all
the cores sharing a global view of a large, multi-banked, L1 scratchpad memory,
accessible within at most five cycles in the absence of conflicts. We provide
multiple runtimes to program MemPool at different abstraction levels and
illustrate its versatility with a wide set of applications. MemPool runs at 600
MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX
technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less
than 2% of execution stalls.Comment: 14 pages, 17 figures, 2 table
Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV1.0 Compliant Open-Source Processor
Vector processing is highly effective in boosting processor performance and
efficiency for data-parallel workloads. In this paper, we present Ara2, the
first fully open-source vector processor to support the RISC-V V 1.0 frozen
ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels
for various problem sizes and vector-unit configurations, achieving an average
functional-unit utilization of 95% on the most computationally intensive
kernels. We pinpoint performance boosters and bottlenecks, including the scalar
core, memories, and vector architecture, providing insights into the main
vector architecture's performance drivers. Leveraging the openness of the
design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on
various configurations (2-16 lanes), and analyze its microarchitecture and
implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency
of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40
FO4 gates). Finally, we explore the performance and energy-efficiency
trade-offs of multi-core vector processors: we find that multiple vector cores
help overcome the scalar core issue-rate bound that limits short-vector
performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves
more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when
executing a 32x32x32 matrix multiplication, with 1.5x improved energy
efficiency
XNORBIN: A 95 TOp/s/W Hardware Accelerator for Binary Convolutional Neural Networks
Deploying state-of-the-art CNNs requires power-hungry processors and off-chip
memory. This precludes the implementation of CNNs in low-power embedded
systems. Recent research shows CNNs sustain extreme quantization, binarizing
their weights and intermediate feature maps, thereby saving 8-32\x memory and
collapsing energy-intensive sum-of-products into XNOR-and-popcount operations.
We present XNORBIN, an accelerator for binary CNNs with computation tightly
coupled to memory for aggressive data reuse. Implemented in UMC 65nm technology
XNORBIN achieves an energy efficiency of 95 TOp/s/W and an area efficiency of
2.0 TOp/s/MGE at 0.8 V
YodaNN: An ultra-low power convolutional neural network accelerator based on binary weights
Convolutional Neural Networks (CNNs) have revolutionized the world of image classification over the last few years, pushing the computer vision close beyond human accuracy. The required computational effort of CNNs today requires power-hungry parallel processors and GP-GPUs. Recent efforts in designing CNN Application-Specific Integrated Circuits (ASICs) and accelerators for System-On-Chip (SoC) integration have achieved very promising results. Unfortunately, even these highly optimized engines are still above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. On the algorithmic side, highly competitive classification accuracy canbe achieved by properly training CNNs with binary weights. This novel algorithm approach brings major optimization opportunities in the arithmetic core by removing the need for the expensive multiplications as well as in the weight storage and I/O costs. In this work, we present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a corearea of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V. Our accelerator outperforms state-of-the-art performance in terms of ASIC energy efficiency as well as area efficiency with 61.2 TOp/s/W and 1135 GOp/s/MGE, respectively